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Abstract 

We design a linear time approximation scheme for the Gale-Berlekamp Switching Game 
and generalize it to a wider class of dense fragile minimization problems including the Nearest 
Codeword Problem (NCP) and Unique Games Problem. Further applications include, among 
other things, finding a constrained form of matrix rigidity and maximum likelihood decoding 
of an error correcting code. As another application of our method we give the first linear 
time approximation schemes for correlation clustering with a fixed number of clusters and its 
hierarchical generalization. Our results depend on a new technique for dealing with small 
objective function values of optimization problems and could be of independent interest. 
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1 Introduction 



The Gale-Berlekamp Switching Game (GB Game) was introduced independently by Elwyn Berlekamp 
|1CH [23] and David Gale [23] in the context of coding theory. This game is played using of a m by m 
grid of lightbulbs. The adversary chooses an arbitrary subset of the lightbulbs to be initially "on." 
Next to every row (resp. column) of lightbulbs is a switch, which can be used to invert the state of 
every lightbulb in that row (resp. column). The protagonist's task is to minimize the number of lit 
lightbulbs (by flipping switches). This problem was proven very recently to be NP-hard [21]. Let 
= {—1, 1} C M. For matrices M,N let d(M,N) denote the number of entries where M and N 
differ. It is fairly easy to see that the GB Game is equivalent to the following natural problems: [21] 

• Given matrix M € <j> mxm fj nc i row vectors x,y € $ m minimizing d(M,xy T ). 

• Given matrix M £ & mxm find rank-1 matrix N E <]? mxm minimizing d(M,N). 

• Given matrix M € F™ xm find x,y G F™ minimizing 1 (M-ij 7^ x i © Uj) where F2 is the 
finite field over two elements with addition operator 0. 

• Given matrix M £ <j> mxm fj nc i row vectors x,y € <3? m maximizing x T My. 

We focus on the equivalent minimization versions and prove existence of linear-time approxi- 
mation schemes for them. 

Theorem 1. For every e > there is a randomized 1 + e- approximation algorithm for the Gale- 
Berlekamp Switching Game (its minimization version) with runtime 0(m 2 ) + 2°( 1//<E ' . 

In order to achieve the linear-time bound of our algorithms, we introduce two new techniques: 
calling the additive error approximation algorithm at the end of our algorithm and greedily refining 
the random sample used by the algorithm. These new methods could also be of independent interest. 

A constraint satisfaction problem (CSP) consists of n variables over a domain of constant-size d 
and a collection of arity-/c constraints (k constant). The objective of MIN-fcCSP (MAX-fcCSP) is to 
minimize the number of unsatisfied (maximize the number of satisfied) constraints. An (everywhere) 
dense instance is one where every variable is involved in at least a constant times the maximum 
possible number of constraints, i.e. Q(n k ~ 1 ). For example, the GB Game is a dense MIN-2CSP 
since each of the n = 2m variables is involved in precisely m = n/2 constraints. It is natural to 
consider generalizing Theorem Q] to all dense MIN-CSPs, but unfortunately many such problems 
have no PTASs unless P=NP \7\ so we must look at a restricted class of MIN-CSPs. A constraint 
is fragile if modifying any variable in a satisfied constraint makes the constraint unsatisfied. A 
CSP is fragile if all of its constraints are. Clearly the GB Game can be modeled as a fragile dense 
MIN-2CSP. Our results generalize to all dense fragile MIN-/cCSPs. 

We now formulate our general theorem. 

Theorem 2. For every e > there is a randomized 1 + e- approximation algorithm for dense fragile 
MIN-kCSPs with runtime 0{n k ) + 2°^/ e2 \ 

Any approximation algorithm for MIN-/cCSP must read (by adversary argument) the entire 
input to distinguish between instances with optimal value of 1 and and hence the 0(n k ) term 
of the runtime cannot be improved. It is fairly easy to see that improving the second term (to 
2°( 1 / e2 )) would imply a 0(n 2 ) + 2°( 1 / e2 )-time PTAS for average-dense max cut. Over a decade 
worth of algorithms El QI1 [21 [20] for MAX-kCSP all have dependence on e of at best 2° ( - 1 / e2 \ so 
any improvement to the runtime of Theorem [2] would be surprising. 
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We begin exploring applications of Theorem [2] by generalizing the Gale-Berlekamp game to 
higher dimensions k (fc-ary GB) and then to arbitrary A;-ary equations. Given n variables x% G F2 
and m linear equations of the form = (or = 1), the /c-ary Nearest Codeword 

Problem (NCP) consists of finding an assignment minimizing the number of unsatisfied equations. 
As the name suggests, the Nearest Codeword Problem can be interpreted as maximum likelihood 
decoding for linear error correcting codes. The Nearest Codeword Problem has fragile constraints 
so Theorem [2] implies a linear-time PTAS for the A:-ary GB problem and the dense A:-ary Nearest 
Codeword Problem. 

The Unique Games Problem (UGP) |12|ll8j consists of solving MIN-2CSPs where the constraints 
are permutations over a finite domain D of colors; i.e. a constraint involving variables x u and x v is 
satisfied iff x u = tt uv (x v ) for permutation ir uv . These constraints are clearly fragile, so Theorem [2] 
implies also a linear-time PTAS for the dense Unique Game Problem (with a constant number of 
colors) . 

The multiway cut problem, also known as MIN-dCUT, consists of coloring an undirected graph 
with d colors, such that each of d terminal nodes is colored with color i, minimizing the number 
of bichromatic edges. The requirement that the terminal nodes must be colored particular colors 
does not fit in our dense fragile MIN-CSP framework, so we use a work-around: let the constraint 
corresponding to an edge be satisfied only if it is monochromatic and the endpoint(s) that are 
terminals (if any) are colored correctly. 

As another application, consider MIN-A;SAT, the problem of minimizing the number of satisfied 
clauses of a boolean expression in conjunctive normal form where each clause has k variables (some 
negated). We consider the equivalent problem of minimizing the number of unsatisfied conjunctions 
of a boolean expression in disjunctive normal form. A conjunction can be represented as a fragile 
constraint indicating that all of the negated variables within that constraint are false and the 
remainder are true, so Theorem [2] applies to MIN-A;SAT as well. 

Finally we consider correlation clustering and hierarchical clustering with a fixed number of 
clusters [T71 Q]. Correlation cluster consists of coloring an undirected graph with d colors (like 
multiway cut so far), minimizing the sum of the number of cut edges and the number of uncut 
non-edges. Correlation clustering with two clusters is equivalent to the following symmetric variant 
of the Gale-Berlekamp game: given a symmetric matrix M 6 <J> mxm find a row vector x £ $ m 
minimizing d(M, xx T ). Like the GB game, correlation clustering with 2 clusters is fragile and 
Theorem [2] gives a linear-time approximation scheme. For d > 2 correlation clustering is not fragile 
but has properties allowing for a PTAS anyway. We also solve a generalization of correlation 
clustering called hierarchical clustering pp. We prove the following theorem. 

Theorem 3. For every e > there is a randomized 1 + e- approximation algorithm for correlation 
clustering and hierarchical clustering with fixed number of clusters d with running time n 

The above results improves on the running time 0(n 9d ^ 2 ) log n = 0(n 9d / e2 ) of the previous 
PTAS for correlation clustering by Giotis and Guruswami [T7] in two ways: first the polynomial is 
linear in the size of the input and second the exponent is polynomial in d rather than exponential. 
Our result for hierarchical clustering with a fixed number of clusters is the first PTAS for that 
problem. 

We prove Theorem [2] in Sections [2] and [3] and Theorem [3] in Sections H] and [5l 
Related Work 

Elwyn Berlekamp built a physical model of the GB game with either m = 8 or m = 10 |XQ(, [23] at 
Bell Labs in the 1960s motivated by the connection with coding theory and the Nearest Codeword 
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Problem. Several works |15[ [TO] investigated the cost of worst-case instances of the GB Game; for 
example the worst-case instance for m = 10 has cost 35 |10| . Roth and Viswanathan [21 j showed 
very recently that the GB game is in fact NP-hard. They also give a linear-time algorithm if the 
input is generated by adding random noise to a cost zero instance. Replacing <3? with K in the third 
formulation of the GB Game yields the problem of computing the 1-rigidity of a matrix. Lower 
bounds on matrix rigidity have applications to circuit and communication complexity |19j . 

The Nearest Codeword Problem is hard to approximate in general [HE] better than n™ 1 ' ' 1 °s 1 °g n ). 
It is hard even if each equation has exactly 3 variables and each variable appears in exactly 3 equa- 
tions [9]. There is a 0(n/ log n) approximation algorithm [8j[3]. 

Over a decade ago two groups [BJ [13] independently discovered polynomial-time approximation 
algorithms for MAX-CUT achieving additive error of ere 2 , implying a PTAS for average-dense MAX- 
CUT instances. The fastest algorithms [2j [20] have constant runtime 2°( 1//<E ) for approximating 
the value of any MAX-/cCSP over a binary domain D. This can be generalized to an arbitrary 
domain D. To see this, note that we can code D in binary and correspondingly enlarge the arity of 
the constraints to A; [log \D\~\. A random sample of 0(l/e 4 ) variables suffices to achieve an additive 
approximation [H [201 [22] . These results extend to MAX-BISECTION [T3] . 

Arora, Karger and Karpinski [6] introduced the first PTASs for dense minimum constraint 
satisfaction problems. They give PTASs with runtime n ^ 2 ) [6] for min bisection and multiway 
cut (MIN-d-CUT). Bazgan, Fernandez de la Vega and Karpinski [7] designed PTASs for MIN- 
SAT and the nearest codeword problem with runtime n°( 1//e '. Giotis and Guruswami [T7] give 
a PTAS for correlation clustering with d clusters with runtime 0(n 9d / e2 ). We give linear-time 
approximation schemes for all of the problems mentioned in this paragraph except for the MIN- 
BISECTION problem. 

2 Fragile-dense Algorithm 
2.1 Intuition 

Consider the following scenario. Suppose that our nemesis, who knows the optimal solution to 
the Gale-Berlekamp problem shown in Figure [H gives us a constant size random sample of it to 
tease us. How can we use this information to construct a good solution? One reasonable strategy 
is to set each variable greedily based on the random sample. Throughout this section we will 
focus on the row variables; the column variables are analogous. For simplicity our example has 
the optimal solution consisting of all of the switches in one position, which we denote by a. For 
row v, the greedy strategy, resulting in assignment x^\ is to set switch v to a iff b(v,a) < b(v,[3), 
where b(v,a) (resp. b(v,f3)) denotes the number of light bulbs in the intersection of row v and the 
sampled columns that would be lit if we set the switch to position a (resp. (3). 

With a constant size sample we can expect to set most of the switches correctly but a constant 
fraction of them will elude us. Can we do better? Yes, we simply do greedy again. The greedy 
prices analogous to b are shown in the columns labeled with b in the middle of Figure [TJ For 
the example at hand, this strategy works wonderfully, resulting in us reconstructing the optimal 
solution exactly, as evidenced by the b(x^\v,a) < b(x^\v,P) for all v. In general this does not 
reconstruct the optimal solution but provably gives something close. 

Some of the rows, e.g. the last one, have b(x^\v,a) much less than b{x^\ v, /3) while other 
rows, such as the first, have b(x^\v,a) and b(x^\v, (5) closer together. We call variables with 
\b(x^\v,a) — b(x^\v,f3)\ > O(re) clearcut. Intuitively, one would expect the clearcut rows to be 
more likely correct than the nearly tied ones. In fact, we can show that we get all of the clearcut 
ones correct, so the remaining problem is to choose values for the rows that are close to tied. 
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Figure 1: An illustration of our algorithmic ideas on the Gale-Berlekamp Game. 



However, those rows have a lot of lightbulbs lit, suggesting that the optimal value is large, so it is 
reasonable to run an additive approximation algorithm and use that to set the remaining variables. 

Finally observe that we can simulate the random sample given by the nemesis by simply taking 
a random sample of the variables and then doing exhaustive search of all possibly assignments of 
those variables. We have just sketched our algorithm. 

Our techniques differ from previous work El [TTj in two key ways: 

1. Previous work used a sample size of 0((log n)/e 2 ), which allowed the clearcut variables to be 
set correctly after a single greedy step. We instead use a constant-sized sample and run a 
second greedy step before identifying the clearcut variables. 

2. Our algorithm is the first one that runs the additive error algorithm after identifying clearcut 
variables. Previous work ran the additive error algorithm at the beginning. 

The same ideas apply to all dense fragile CSPs. In the remainder of the paper we do not 
explicitly discuss the GB Game but present our ideas in the abstract framework of fragile-dense 
CSPs. 



2.2 Model 

We now give a formulation of MIN-/cCSP that is suitable for our purposes. For non-negative 
integers n, k, let (^) = k \^-k)\ > an d for a given set V let (^) denote the set of subsets of V of size 
k (analogous to 2 s for all subsets of S). There is a set V of n variables, each of which can take any 
value in constant-sized domain D. Let i„efl denote the value of variable v in the assignment x. 

Consider some I £ (]Q . There may be many constraints over these variables; number them 
arbitrarily. Define p(I,£,x) to be 1 if the Ith constraint over I is unsatisfied in assignment x and 
zero otherwise. For / € we define pi(x) = | x), where rj is a scaling factor to ensure 

< pi{x) < 1 (e.g. rj = 2 k for MIN-/cSAT). For notational simplicity we write pi as a function 
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of a complete assignment, but pi(x) only depends on x u for variables u £ I. For / ( fc ) define 
pi(sc) = 0. 

Definition 4. On input V,p a minimum constraint satisfaction problem (MIN-kCSP) is a problem 
of finding an assignment x minimizing Obj(x) = ^C/£( v ) 

Let R v i{x) be an assignment over the variables V that agrees with x for all u € V except for 

I % if u — v 

v where it is i; i.e. R v Ax) u = < ;1 . . We will frequently use the identity R vx „(x) = x. 

I x u otherwise 

Let b(x,v,i) = Y^j e fvy veI Pi(Rvi(%)) be the number of unsatisfied constraints v would be in if x v 

were set to i (divided by rj). 

We say the £th constraint over / is fragile if p(I,£, R v i(x)) + p(I,£, (R v j(x)) > 1 for all v E I 
and % 7^ j £ D. 

Definition 5. A Min-kCSP is fragile-dense if b(x,v,i) + b(x,v,j) > S^ k ^_-^ for some constant 
5 > and for all assignments x, variables v and distinct values i and j. 

Lemma 6. An instance where every variable v S V participates in at least SrjL^^) fragile con- 
straints for some constant 5 > is fragile-dense (with the same 5). 

Proof. By definitions: 

b(x,v,i) + b(x,v,j) = Y (pi(Rvi(x)) +pi(R vj (x))) 



> ^2 ~ ' (The number of fragile constraints over /) 



St] ( n \ J a 



~ 7]\k-l) \k - 1 



□ 



We will make no further mention of individual constraints, w or fragility; our algorithms and 
analysis use pj and the fragile-dense property exclusively. 

2.3 Algorithm 

We now describe our linear-time algorithms. The main ingredients of the algorithm are new iterative 
applications of additive error algorithms and a special greedy technique for refining random samples 
of constant size. 

Let s = 181 °g( 4 ^o|- p l fc / ( ^) anc [ Si, S2, ■ ■ ■ , S s be a multiset of independent random samples of 
k — 1 variables from V. One can estimate b(x*,v,i) using the unbiased estimator b(v,i) = 

s ^ S j=iPSjU{v}(Rvi(x*)) (see Lemma [131 for proof). One can determine the necessary x* by 
exhaustively trying each possible combination. 
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Algorithm 1 Our algorithm for dense-fragile MIN-fcCSP 

1: Run a j^S 2 /72k[fy additive approximation algorithm. 

2: if Obj(answer) > ( n k )5 2 /72k then 

3: Return answer. 

4' else 

5 ; Let s = mosimows) 

6: Draw Si, S2, ■ ■ ■ , S s randomly from (J^) with replacement. 

7: for Each assignment x* of the variables in ljj=i ^° 

8: For all v and i let = 52j=iPSjU{v}(Rvi{ x *)) 

9: For all G 1/ let xffl = argminj b(v, i) 

10: For all v € V let x^ = argminj b(x^\ v, i) 

11: Let C = {v £V : 6(x«, v, x£ 2) ) < b(x^\v,j) - 6(^/6 for all j / x£ 2) }. 

12: Find x^ 3 ) of cost at most d^A*Z!i (fc"j) + min [06j(x)] using an additive approximation 

algorithm, where the minimum ranges over x such that x v = Xy Vv £ C . 

13: end for 

14: Return the best assignment x® found. 

15: end if 



3 Analysis of Algorithm [T] 

We use one of the known additive error approximation algorithms for MAX-A;CSP problems. 

Theorem 7. 120/ For any MAX-kCSP (or MIN-kCSP) and any e' > there is a randomized 
algorithm which returns an assignment of cost at most OPT + e'n k in runtime 0(n k ) + 2°( 1 / e \ 

Throughout the rest of the paper let x* denote an optimal assignment. 

First consider Algorithm [I] when the "then" branch of the "if" is taken. Choose constants 
appropriately so that the additive error algorithm fails with probability at most 1/10 and assume 
it succeeds. Let x a denote the additive-error solution. We know Obj{x a ) < Obj(x*) + j^P 
and Obj(x a ) > P where P = ( n k )5 2 /72k. Therefore Obj(x*) > P(l - ^) = ^ and hence 
Obj(x a ) < Obj(x*) + yq^-(l + e)Obj(x*) = (1 + e)Obj(x*). Therefore if the additive approximation 
is returned it is a 1 + e- approximation. 

The remainder of this section considers the case when Algorithm [1] takes the "else" branch. 
Define 7 so that Obj{x*) = We have Obj(x*) < Obj(x a ) < (l)5 2 /72k so 7 < 5 2 /72k. We 

analyze the x* where we guess x*, that is when x* = x* for all v G UI=i Clearly the overall cost 
at most the cost of x^ 3 ) during the iteration when we guess correctly. 

Lemma 8. b(x*,v,x*) < b(x*,v,j) for all j G D. 

Proof. Immediate from definition of b and optimality of x* . □ 
Lemma 9. For any assignment x, 

Obj(x) = - ^2 b(x,v,x v ) 
vev 
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Proof. By definitions, 

b(x,v,x v )= ^2 Vi{Rvx v { x ))= ^2 Pi( x )- 
ie( v k ):vei ie( v k y,,ei 

Write Obj(x) = J2j e Q r ^Pi( x ) = J2i e ( v ~)Pi( x ) [Ylvei %] an ^ reorder summations. □ 
Definition 10. We say variable v in assignment x is corrupted if x v ^ x*. 

Definition 11. Variable v is clear if (x*, v, x%) < b(x*,v,j) — f or oil 3 x %- A variable is 

unclear if it is not clear. 

Clearness is the analysis analog of the algorithmic notion of clear-cut vertices sketched in Sec- 
tion 12.11 Comparing the definition of clearness to Lemma [5] further motivates the terminology 
"clear." 

Lemma 12. The number of unclear variables t satisfies 

Sn 

*<3( n _fc + l) 7 /j)<_ 

Proof. Let v be unclear and choose j ^ x* minimizing b(x*,v,j). By unclearness, b(x*,v,x%) > 
b(x*,v,j) - (1/3)5( A ." 1 ). By fragile-dense, b(x*,v,x*) + b(x*,v,j) ^^(^"J. Adding these inequal- 
ities we see 

By Lemma [9] and (|TJ) , 

' •• v.unclear " 7 



Therefore t < 7(fc)^rj = ¥(n - k + 1). 

For the second bound observe 3nj/5 < xt^£ = We' ^ 

Lemma 13. The probability of a fixed clear variable v being corrupted in x^ is bounded above by 
8 

240fc • 

Proof. First we show that b(v, i) is in fact an unbiased estimator of b(x*,v, i) for all i. By definitions 
and particular by the assumption that pi = when |/| < k, we have for any 1 < j < s: 



E 



P,S,U{«}0Rw0O) = -TTTT Yl PjU{v}(Rvi( x *)) 

= J~^~) pi{Rvi{ x *)) 

= ( n \ bvi { x ) 
\k-l) 



Therefore E 



i ( n ) 

b(v,i) = s^E[p SlU{v} (R vi (x*))] =b(x*,v,i). 
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Recall that < pi(x) < 1 by definition of p, so by Azuma-Hoeffding, 



Pr 



5^PSjU{«}(^«i(^*)) - j4rr 

j=l \k-l) 



b(x* , v, i) 



> As 



< 2e 



-2\ 2 s 



hence 



Pr 



\b(v,i) - b(x*,v,i)\ > A 
Choose A = 5/6 and recall s 



n 



k — 1 
, yielding. 



< 2e 



-2\ 2 s 



Pr 



181og(480|D|fc/5) 
52 



|6(v,£) - b(x*,v,i)\ > - 



n 



1 



< 



240 D it 



By clearness we have b(x*, v,j) > b(x*,v, x*)+<5( fc ^ 1 j/3 for all j 7^ x*. Therefore, the probability 
that 6(v,x*) is not the smallest b(v,j) is bounded by \D\ times the probability that a particular 
b(v,j) differs from its mean by at least 5( fc " 1 )/6. Therefore Pr \x~ 7^ x*] < \D\ 
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□ 



I 240|D|fe 

Let E\ denote the event that the assignment x^ has at most Sn/12k corrupted variables. 
Lemma 14. Event E\ occurs with probability at least 1 — 1/10. 

Proof. We consider the corrupted clear and unclear variables separately. By Lemma [12l the number 

Sn 
2Ak 



of unclear variables, and hence the number of corrupted unclear variables, is bounded by 4zv- 



The expected number of clear corrupted variables can be bounded by using Lemma \13[ so 
by Markov bound the number of clear corrupted variables is less than with probability at least 
1-1/10. 



Therefore the total number of corrupted variables is bounded by mnr + 



^Pr with probability 
□ 



at least 9/10. 

We henceforth assume E\ occurs. The remainder of the analysis is deterministic. 

Lemma 15. For assignments y and y' that differ in the assignment of at most t variables, for all 
variables v and values i, \b(y,v,i) — b{y' ,v,i)\ ^tu!^)- 

Proof. Clearly pi(R v i(y)) is a function only of the variables in I excluding v, so if I — {v} consists 
of variables u where y u = y' u , then pi(Rvi(y)) — Pi(Rvi(y')) = 0. Therefore b(y,v,i) — b(y',v,i) 
equals the sum, over I € (T) containing v and at least one variable u other than v where y u ^ y' u , 
of \pi(R v i(y)) - pi(R vi (y'))}. For any I, \pi(R vi (y)) - Pi(R v i(y'))\ < 1, so by the triangle inequality 
a bound on the number of such sets suffices to bound \b(y,v,i) — b(y',v,i)\. The number of such 
sets can trivially be bounded above by i( fc " 2 )" '— ' 

Lemma 16. Let C = {v G V : 6(>W, v, x^ ] ) < b(x^ , v, j) - 6{ k "J/6 for all j / x^} as defined 
in Algorithm^ If E\ then: 

,(2) 



r v — x* for all v G C . 

'22 

5 



• \v\c\ < 
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Proof. Assume E\ occurred. From the definition of corrupted, event E\ and Lemma [15] for suffi- 
ciently large n (so that n ^^ 1 > ^) for any v,i: 

For the first, if v G C then using ([2]) 



< ( , (l .,„, i) + (^ + 2 jL)(^ i ) =Kl .,„, J ). 

So by Lemma [H x* = x^ 2 \ 

For any u that is clear, using ([2]) again: 

<- ^-^H^^G-O-^-^-iG-!)- 

so by definition of C, u S C. Therefore the conclusion follows from Lemma [T2l □ 

Now we give the details of the computation of x^ . Let T = V \ C. We call C the clear-cut 
vertices and T the tricky vertices. We assume that |T| > k; if not simply consider every possible 
assignment to the variables in T. With the variables in C fixed, those variables can be substituted 
into the pi and eliminated. To restore a uniform arity of k we pad the pi of arity less than k 
with irrelevant variables from T. To ensure none of the resulting pi has excessive weight we use a 
uniform mixture of all possibilities for the padding vertices. 

y v If v G T 
a 

ization of the R v i(x) notation. For K £ (T) and y E Z)' T ' define 



If y is an assignment to the variables in T let Rti,(x*) = < Z „ . . , a natural general- 
y & yK 1 > x* Otherwise ' & 



«*(*)=£ E E ^(^(x( 2 )))( l ^ 1 _/V 1 



i=lj e (-)L e ( fc ^. 



It is easy to see that qxiy) is a function only of y v for v £ K and is hence a cost function analogous 
to pi (though not properly normalized). 



(2)^ 



Lemma 17. For any y £ -D' T ' we have 

Obj(R Ty (xW))= Y, Qk(v) + E pi( 
Ke (k) H° k ) 

Proof. Let x = Rxy(x^). By definition 

E m = E E E E ^^(T7)" (3) 
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Compare to 



obj(x) - pM 2) ) = E ( 4 ) 

Fix / G (Y) and study the weight of pi{x) in the right-hand-sides of ([3]) and (HJ). Note there are 
unique j > 0, J G (J) and L G such that I = J U L. If j = then pj(a;) has weight in ([3]) 

and in If j > 1 then pi(x) appears once in (|3|) for each G (^) such that K ^> J. There are 

('fct/D °^ those and each has weight ('^ll/) so pi(x) has an overall weight of 1 in ([3]). Clearly 
j > 1 implies I % C hence the weight of pi{x) in ([!]) is 1 as well. □ 

Lemma 18. 

o<^(y)<o((j^) fe J 

Proof. Recalling that < pi(y) < 1 and /c = 0(1): 

□ 

Lemma [18] and Theorem [7] with an error parameter of e' = @(e) yields an additive error 
of (^(elrl^dCI/lrl)^- 1 ) = 0(e(\T\/\C\)n k ) for the problem of minimizing £ice( T ) qxiy)- Using 

Lemma [16] we further bound the additive error 0(e(|T|/|(7|)n fe ) by 0(e 7 n fe ). By Lemma H7] this 
is also an additive error 0(ejn k ) for Obj(RT y (x^)). Lemma [16] implies that x* = Rt v (x^) for 
some y, so this yields an additive error 0(e-yn k ) = eOPT for our original problem of minimizing 
Obj(x) over all assignments x. 



4 Correlation Clustering and Hierarchical Clustering Algorithm 
4.1 Intuition 

As we noted previously in Section [1] correlation clustering constraints are not fragile for d > 
2. Indeed, the constraint corresponding to a pair of vertices that are not connected by an edge 
can be satisfied by any coloring of the endpoints as long as the endpoints are colored differently. 
Fortunately there is a key observation in [17J that allows for the construction of a PTAS. Consider 
the cost-zero clustering shown on the left of Figure [5] Note that moving a vertex from a small 
cluster to another small one increases the cost very little, but moving a vertex from a large cluster 
to anywhere else increases the cost a lot. Fortunately most vertices are in big clusters so, as in 
[T7] , we can postpone processing the vertices in small clusters. We use the above ideas, which 
are due to [17j . the fragile-dense ideas sketched above, plus some additional ideas, to analyze our 
correlation clustering algorithm. 

To handle hierarchical clustering (c.f. pQ) we need a few more ideas. Firstly we abstract the 
arguments of the previous paragraph to a CSP property rigidity. Secondly, we note that the number 
of trees with d leaves is a constant and therefore we can safely try them all. We remark that all 
fragile-dense problems are also rigid. 
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Figure 2: An illustration of correlation clustering and the rigidity property. 
4.2 Reduction to Rigid MIN-2CSP 

We now define hierarchical clustering formally (following [lj). For integer M > 1, an M -level 
hierarchical clustering of n objects V is a rooted tree with the elements of V as the leaves and 
every leaf at depth (distance to root) exactly M + 1. For M = 1, a hierarchical clustering has 
one node at the root, some "cluster" nodes in the middle level and all of X in the bottom level. 
The nodes in the middle level can be identified with clusters of V . We call the subtree induced by 
the internal nodes of a M-level hierarchical clustering the trunk. We call the leaves of the trunk 
clusters. A hierarchical clustering is completely specified by its trunk and the parent cluster of each 
leaf. 

For a fixed hierarchical clustering and clusters i and j, let f(i,j) be the distance from i (or j) 
to the lowest common ancestor of i and j. For example when M = 1, f(i,j) = 1 (i = j). 

We are given a function F from pairs of vertices to {0, l,...M}Jj The objective of hier- 
archical clustering is to output a M-level hierarchical clustering minimizing ^2 UV jj\F(u,v) — 
f (parent (u), parent (v))\. Hierarchical clustering with d clusters is the same except that we restrict 
the number of clusters (recall that equals number of nodes whose children are leaves) to at most d. 
The special case of hierarchical clustering with M = 1 is also called correlation clustering. 

Lemma 19. The number of possible trunks is at most d^ M ~ 1 ' d . 

Proof. The trunk can be specified by giving the parent of all non-root nodes. There are at most d 
nodes on each of the M — 1 non-root levels so the lemma follows. □ 

We now show how to reduce hierarchical clustering with a constant number of clusters to the 
solution of a constant number of min-2CSPs. We use notation similar to, but not identical to, 
the notation used in Sections [2] and El For vertices u,v and values i,j, let p u ,v(i,j) be the cost of 
putting u in cluster i and v in cluster j. This is the same concept as pi for the fragile case, but 
this notation is more convenient here. Define b(x,v,i) = Yl u eVu^vPu,v(xu,i), which is identical to 
b of the fragile-dense analysis but expressed using different notation. 



1 [T] chose {1, 2, , ...M + 1} instead; the difference is merely notational. 
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Definition 20. A MIN-2CSP is rigid if for some 5 > 0, all v G V and all j / x* 

b(x*,v, x*) + b(x*,v,j) > 6\{u G V : x* u = x* v }\ 
Observe that |{u G V : x^ = x 

* v }\ < \V\ = (H) hence 

any fragile-dense CSP is also rigid. 

Lemma 21. If the trunk is fixed, hierarchical clustering can be expressed as a 1/M -rigid MIN-2CSP 
with \D\ = d. 

Proof. (C.f. Figure [2]) Choose 5 = 1/M. Let D be the leaves of the trunk (clusters). It is easy to 
see that choosing 

yields the correct objective function. To show rigidity, fix vertex v, define i = x* and Cj = {u G 
V : x* u = i}. Fix j ^ i and u G C{ \ {v}. Clearly \f(i, i) — f(i,j)\ > 1, hence by triangle inequality 
\F(u,v) - f(i,i)\ + \F(u,v) - f{i,j)\ > 1, hence p u>v (i,i) +p u ,v(hj) > l / M - Summing over u G C t 
we see 

b(x*,v,x* v ) + b(x\v,j)>^\C l \{v}\^^\C l \=5\{ueV : < = x* v }\ 
Sweeping the "~" under the rug this proves the LemmaH □ 

Lemmas [21] and [19] suggest a technique for solving hierarchical clustering: guess the trunk and 
then solve the rigid MIN-2CSP. We now give our algorithm for solving rigid MIN-2CSPs. 

4.3 Algorithm for Rigid MIN-2CSP 

Algorithm [2] solves rigid MIN-2CSPs by identifying clear-cut variables, fixing their value, and then 
recursing on the remaining "tricky" variables T. The recursion terminates when the remaining 
subproblem is sufficiently expensive for an additive approximation to suffice. 

5 Analysis of Algorithm [2] 
5.1 Runtime 

Theorem 22. For any T,y, an assignment of cost at most e'\T\ 2 + Tam x . Xv=yv \/ v& y\x [Obj(x)] can 
be found in time n 2 2 ^ 1 / <L ' 2 \ 

Proof. The problem is essentially a CSP on T vertices but with an additional linear cost term 
for each vertex. It is fairly easy to see that Algorithm 1 from Mathieu and Schudy [20] has error 
proportional to the misestimation of b and hence is unaffected by arbitrarily large linear cost terms. 
On the other hand, the more efficient Algorithm 2 from [20] needs to estimate the objective value 
from a constant-sized sample as well and hence does not seem to work for this type of problem. □ 

In this subsection O(-) hides only absolute constants. Algorithm [2] has recursion depth at 
most \D\ + 1 and branching factor \D\ S , so the number of recursive calls is at most (|D| S )I D I +1 = 
2s(|£>|+i)iog|-D| _ 20(|D| 5 /(5 4 )_ Each ca ll spends 0(\D\n 2 ) time on miscellaneous tasks such as comput- 
ing the objective value plus time required to run the additive error algorithm, which is n 2 2 ^ D ^^ e2s6 ^ 



2 There are inelegant ways to remove this approximation. For example, assume that all d clusters of x* are 
non-empty and consider one vertex from Cj as well. 
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Algorithm 2 Approximation Algorithm for Rigid MIN-2CSPs. 



Return CC(V, blank assignment, 0) 



CC(tricky vertices T, assignment y of V \ T, recursion depth depth): 

Find assignment of cost < 
approximation algorithm. 



1: Find assignment of cost at most • g.^iLa + m ^ n x-.x v =y v \/veV\T [Obj(x)] using an additive 



if Obj (answer ) > 6 . 72 ^£>|3 or depth > \D\ + 1 then 

Return answer. 
else 

432 2 |D| 4 log(1440|D| 3 /(5) 
W 



Let s 

Draw vi,V2, . . . ,v s randomly from T with replacement, 
for Each assignment x* of the variables {v±, t>2, ■ ■ ■ , v s} do 

For all v G T and i let b(v,i) = ^ Ej=i £S^( x ?v + Y.uav\T Pu,v(Vu, i) 

9: Pbr all t, <= V let = ( . v , * V * V } T 

[ argmiiijO(u,i) Otherwise 

10: For all v G T let Xy = argminj b(x^ , v , i) 

11: Let C = {d G T : u, x^ 2) ) < b(x^,v,j) - Jgj for all j ^ x£ 2) }. 

12: Let T' = T\ C 

y v UveV\T 
13: Define assignment y 1 by y' v = t x ^ If v G C 
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Undefined If v G T \ C 
If CC(T' ,y' , depth + 1) is the best clustering so far, update best, 
end for 

Return the best clustering found, 
end if 



by Theorem [22j Therefore the runtime of Algorithm [2] is n 2 2 0<K ( 2 ^\ where the 2 < ^ > ^ D ^^ 4 ^ from the 

Q, \Df S 

size of the recursion tree got absorbed into the 2 '•l^s 5 ' from Theorem 1221 For hierarchical Cluster- 
's \D\®M^ 

ing, 5 = l/M yields a runtime of n 2 2° { ^~ ) ■ |£>|(M-i)|£>| = n 2 2° ( ^~~ \ 

As noted in the introduction this improves on the runtime of n V e / of p2] for correlation 
clustering in two ways: the degree of the polynomial is independent of e and \D\, and the dependence 
on \D\ is singly rather than doubly exponential. 

5.2 Approximation 

We fix optimal assignment x* . We analyze the path through the recursion tree where we always 
guess x* correctly, i.e. x* = x* for all v G {vi, V2, . . . ,v s }. We call this the principal path. 
We will need the following definitions. 

Definition 23. Vertex v is m-clear if b(x* ,v,x*) < b(x*,v,j) — m for all j ^ x*. We say a vertex 
is clear if it is m-clear for m obvious from context. A vertex is unclear if it is not clear. 

Definition 24. A vertex is obvious if it is in cluster C in OPT and it is 5\C\/3-clear. 

Definition 25. A cluster C of OPT is finished w.r.t. T ifTnC contains no obvious vertices. 
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Lemma 26. With probability at least 8/10, for any [T, y, depth) encountered on the principle path, 

1. y v = x* for all v G V \ T and 

2. The number of finished clusters w.r.t. T is at least depth. 

Before proving Lemma 1261 let us see why it implies Algorithm [2] has the correct approximation 
factor. 

Proof. Study the final call on the principal path, which returns the additive approximation clus- 
tering. The second part of Lemma [261 implies that depth < \D\, hence we must have terminated 
because Ob j {answer) > 6 . 72 L ■ By the first part of Lemma [26] the additive approximation gives 
error at most 

e 5 3 \T\ 2 

— h OPT 

1 + e 6 - 72 2 |Z?P ^ 

so the approximation factor follows from an easy calculation. □ 

Now we prove Lemma l26l by induction. Our base case is the root, which vacuously satisfies the 
inductive hypothesis since V \ T = {} and depth = 0. We show that if a node (T, y, depth) (in the 
recursion tree) satisfies the invariant then its child (T',y', depth + 1) does as well. We hereafter 
analyze a particular (T, y, depth) and assume the inductive hypothesis holds for them. There is 
only something to prove if a child exists, so we hereafter assume the additive error answer is not 
returned from this node. We now prove a number of Lemmas in this context, from which the fact 
that T',y', depth + 1 satisfies the inductive hypothesis will trivially follow. 

Lemma 27. The number of 5 2 \T\/216D 2 -clear variables that are corrupted in x^ is at most 
5\T\/72\D\ with probability at least 1 - 1/10|D|. 

Proof. Essentially the same proof as for fragile MIN-A:CSP, and the recursion invariant, shows b(v, i) 
is an unbiased estimator of b(x* , v, i). 
This time Azuma-Hoeffding yields 



Pr 



\b(v,i) -b(x*,v,i)\ > X\T\ 



< 2e~ 2x2t 



Choose A = 432 ^L 2 and recall s = 432 L^j lo g^ 44 o|-D| /&) ^ yielding. 



Pr 



\b(v,i) - b(x*,v,i)\ > 5 2 \T\/432\D\ 



< 



720IDI 3 



By clearness we have b(x*,v,j) > b(x*,v,xl) + J 2 |T|/216|D| 2 for all j ^ x*. Therefore, the 
probability that b(v,x v ) is not the smallest b(v,j) is bounded by \D\ times the probability that 

a particular b(v,j) differs from its mean by at least 5 2 \T\/432\D\ 2 . Therefore Pr Ix^ + x*\ < 
\D\ 720 |£)|3 = 720IDP • Therefore, by Markov bound, with probability 1 — 1/10 \D\ the number of 
corrupted <5 2 |T|/216-D 2 -clear variables is at most <5|T|/72|D|. □ 

There are two types of bad events: the additive error algorithm failing and our own random 
samples failing. We choose constants so that each of these events has probability at most 1/10|D|. 
This path has length at most \D\, so the overall probability of a bad event is at most 2/10. We 
hereafter assume no bad events occur. 
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Lemma 28. The number of 5c/3-unclear variables in clusters of size at least c is at most e °^ T ■ 

Let confusing variable refer to a <5c/3-unclear variable in a cluster of size at least c. Let v be 
such a variable, in cluster C in OPT. By unclearness, 

Proof. 

b(x* , v, x*) > b(x*,v,j) — 5c/3 

for appropriate j ^ x* and by rigidity 

b(x*,v,x* v ) + b(x*,v,j) >S\C\. 

Adding these inequalities we see b(x*,v,xl) > 5c/3. 

OPT = l/2J2 v b(x\v,x* v ) > 1/2 J2 V confusing 6c / 3 = \{v € T : v confusing}|«5e/6 so \{v € 
T : v confusing}] < □ 

Lemma 29. Forallv,i, \b(x^ 1 \v,i) — b{x* ,v,i)\ < 

Proof. First we show bounds on three classes of corrupted variables: 

1. The number of <5 2 |T|/216-D 2 -clear corrupted vertices is bounded by <5|T|/72|Z)| using Lemma [271 

2. The number of vertices in clusters of size at most 5\T\/72\D\ 2 is bounded by 5\T\/72\D\. 

3. The number of <5 2 |T|/216D 2 -unclear corrupted vertices in clusters of size at least 5|T|/72|D| 2 
is bounded by, using Lemma [281 ^p^ff- - e-72^Dp " & '™\t\ = 1^D\- 

Therefore the total number of corrupted variables in x^ is at most f$X + <mm 72\D\ = 24 \ D \ • 
The easy observation that \b(x^ ,v,i) — b(x* ,v,i) \ is bounded by the number of corrupted variables 
in x^ proves the Lemma. □ 

Lemma 30. There exists an obvious vertex in T that is in a cluster of size at least \T\/2\D\. 

Proof. Simple counting shows there are at most \T\/2 vertices of T in clusters of size less than 
|r|/2|D|. 

We say a vertex v is confusing' if it is non-obvious and its cluster in OPT has size at least 
\T\/2\D\. By LemmaEH 

\{v € T : v confusing'}| < 'j^OPT < < 1^1/2 

Therefore by counting there must be an obvious vertex in a big cluster of OPT. □ 

Lemma 31. The number of finished clusters w.r.t. T' strictly exceeds the number of finished 
clusters w.r.t. T. 

Proof. Let v be the vertex promised by Lemma [30] and Ci its cluster in OPT. For any obvious 
vertex u in C{ note that u is <5|Cj[/3 > <5|T|/6|D|-clear, so Lemma [291 implies 



b(x^',u,i) < b(x*,u,i)-\ — — — - < b(x*, u, j) + 



24|D| x ' 1JJ 2A\D\ Q\D\ 



< b{x w ,u,j) +2—!—^ - -f-^r = b(xW,u,j) 



24\D\ 6\D\ v ' \2\D\ 

hence u S C. Therefore, no obvious vertices in Ci are in T' so Ci is finished w.r.t. T' . The existence 
of v implies Ci is not finished w.r.t. T, so Ci is newly finished. To complete the proof note that 
T' C T so finished is a monotonic property. □ 
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Lemma 32. (T", y') satisfy the invariant v E V \ T' — > y' v = x*. 

Proof. Fix v € V\T' . If v G T the conclusion follows from the invariant for (T, y). If v £ T\T' = C 
we need to show y' v = x*. 

Let i = y' v . For any j ^ i, use Lemma [29] to obtain 

b{x\v,i) < Vji )+J& < bix^lv, < b ^,v,j)+2^-—^- = b(x*,v,j) 

so by optimality of x* we have the Lemma. □ 
Lemmas [31] and [32] complete the inductive proof of Lemma [ 
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